Skip to main content

Table of Contents

Layer Normalization

Layer normalization is a technique used in NN's to normalize the activations of a layer, making training more stable and faster

Basically, it helps to ensure activations (inputs) have "normal", or at least more consistent, distribution

It's the same idea as why we normalize our features for Regression, Classification, or any other model - it ensures consistency and generalization so that one feature / input doesn't dominate the entire model

Layer Normalization operates by calculating the mean and variance of the inputs for each sample, and then applying normalization transformation on top of it

LayerNorm vs Generic Normalization

TLDR;

  • For each sample (row), you compute the mean and variance across that sample’s features.
  • Each feature in the row is normalized using the statistics (mean/variance) of that row only.
  • No information from other samples is used.

The main difference in LayerNorm is our input - in LayerNorm we will usually get batches of data into our NN, and the batches are made up of multiple rows (samples)

LayerNorm normalizes all of the features in that sample using the mean and variance of that sample, it does NOT use information from other samples in the batch

a₁₁a₁₂a₁₃
a₂₁a₂₂a₂₃
a₃₁a₃₂a₃₃

This basically means we normalize over the row, and not the column which is the main difference from BatchNorm as well

For input vector xx of length dd - (x1,x2,...,xd)(x_1, x_2, ..., x_d): μ=1di=1dxi\mu = \frac{1}{d} \sum_{i=1}^d x_i

σ2=1di=1d(xiμ)2\sigma^2 = \frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2

LayerNorm(x)=xμσ2+ϵ\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

BatchNorm

TLDR;

  • For each feature (column), you compute the mean and variance across the batch (i.e., across all samples (rows) for that feature (column)).
  • Each feature is normalized using the statistics of that feature across all samples in the batch.
  • No information from other features in the same row is used.

BatchNorm is equivalent to LayerNorm, except our features are calculated across the feature across the entire batch, basically meaning we normalize over the column and not the row

a₁₁a₁₂a₁₃
a₂₁a₂₂a₂₃
a₃₁a₃₂a₃₃

μj=1ki=1kxi,j\mu_j = \frac{1}{k} \sum_{i=1}^k x_{i,j}

σj2=1ki=1k(xi,jμj)2\sigma_j^2 = \frac{1}{k} \sum_{i=1}^k (x_{i,j} - \mu_j)^2

BatchNorm(xi,j)=xi,jμjσj2+ϵ\text{BatchNorm}(x_{i,j}) = \frac{x_{i,j} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}

What It Solves

Vanishing / Exploding Gradient

Internal Covariate Shift Problem